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Abstract 

We address the problem of detecting deviations of binary sequence 
from randomness, which is very important for random number (RNG) 
and pseudorandom number generators (PRNG). Namely, we consider 
a null hypothesis Hq that a given bit sequence is generated by Bernoulli 
source with equal probabilities of and 1 and the alternative hypoth- 
esis Hi that the sequence is generated by a stationary and ergodic 
source which differs from the source under Hq. We show that data 
compression methods can be used as a basis for such testing and de- 
scribe two new tests for randomness, which are based on ideas of 
universal coding. Known statistical tests and suggested ones are ap- 
plied for testing PRNGs. Those experiments show that the power of 
the new tests is greater than of many known algorithms. 
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1 Introduction 



The randomness testing of random number and pseudorandom number gen- 
erators is used for many purposes including cryptographic, modeling and 
simulation applications; see, for example, Knuth, 1981; L'Ecuyer, 1994; Mau- 
rer,1992; Menezes A. and others, 1996. For such applications a required bit 
sequence should be true random, i.e., by definition, such a sequence could be 
interpreted as the result of the flips of a "fair" coin with sides that are la- 
beled "0" and "1" (for short, it is called a random sequence; see Rukhin and 
others, 2001). More formally, we will consider the main hypothesis H that a 
bit sequence is generated by the Bernoulli source with equal probabilities of 
0's and l's. Associated with this null hypothesis is the alternative hypothesis 
Hi that the sequence is generated by a stationary and ergodic source which 
generates letters from {0, 1} and differs from the source under H . 

In this paper we will consider some tests which are based on results and 
ideas of Information Theory and, in particular, the source coding theory. 
First, we show that a universal code can be used for randomness testing. 
(Let us recall that, by definition, the universal code can compress a sequence 
asymptotically till the Shannon entropy per letter when the sequence is gen- 
erated by a stationary and ergodic source). If we take into account that the 
Shannon per-bit entropy is maximal (1 bit) if H is true and is less than 
1 if Hi is true (Billingsley, 1965; Gallager, 1968), we see that it is natural 
to use this property and universal codes for randomness testing because, in 
principle, such a test can distinguish each deviation from randomness, which 
can be described in a framework of the stationary and ergodic source model. 
Loosely speaking, the test rejects H if a binary sequence can be compressed 
by a considered universal code (or a data compression method.) 

It should be noted that the idea to use the compressibility as a measure 
of randomness has a long history in mathematics. The point is that, on the 
one hand, the problem of randomness testing is quite important for practice, 
but, on the other hand, this problem is closely connected with such deep 
theoretical issues as the definition of randomness, the logical basis of proba- 
bility theory, randomness and complexity, etc; see Kolmogorov, 1965; Li and 
Vitanyi, 1997; Knuth, 1981; Maurer,1992. Thus, Kolmogorov suggested to 
define the randomness of a sequence, informally, as the length of the short- 
est program, which can create the sequence (if one of the universal Turing 
machines is used as a computer). So, loosely speaking, the randomness (or 
Kolmogorov complexity) of the finite sequence is equal to its shortest de- 
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scription. It is known that the Kolmogorov complexity is not computable 
and, therefore, cannot be used for randomness testing. On the other hand, 
each lossless data compression code can be considered as a method for upper 
bounding the Kolmogorov complexity. Indeed, if x is a binary word, is 
a data compression code and <f>(x) is the codeword of x, then the length of 
the codeword \<f>(x)\ is the upper bound for the Kolmogorov complexity of 
the word x. So, again we see that the codeword length of the lossless data 
compression method can be used for randomness testing. 

In this paper we suggest tests for randomness, which are based on results 
and ideas of the source coding theory. 

Firstly, we show how to build a test basing on any data compression 
method and give some examples of application of such test to PRNG's testing. 
It should be noted that data compression methods were considered as a 
basis for randomness testing in literature. For example, Maurer's Universal 
Statistical Test, Lempel-Ziv Compression Test and Approximate Entropy 
Test are connected with universal codes and are quite popular in practice, 
see, for example, Rukhin and others, 2001. In contrast to known methods, 
the suggested approach gives a possibility to make a test for randomness, 
basing on any lossless data compression method even if a distribution law of 
the codeword lengths is not known. 

Secondly, we describe two new tests, conceptually connected with univer- 
sal codes. When both tests are applied, a tested sequence x\Xi...x n is divided 
into subwords X\Xi...x s , x s+ iX s+2 ---X2 S , ■ ■ ■ , s > 1, and the hypothesis Hq that 
the subwords obey the uniform distribution (i.e. each subword is generated 
with the probability 2~ s ) is tested against H* = -<Hq. The key idea of the 
new tests is as follows. All subwords from the set {0, 1} S are ordered and this 
order changes after processing each subword Xj S+1 Xj S+ 2--.x^ + i) s , j — 0, 1, . . . 
in such a way that, loosely speaking, the more frequent subwords have small 
ordinals. When the new tests are applied, the frequency of different ordinals 
are estimated (instead of frequencies of the subwords as for, say, chi- square 
test). 

The natural question is how to choose the block length s in such schemes. 
We show that, informally speaking, the block length s should be taken quite 
large due to the existence of so called two-faced processes. More precisely, 
it is shown that for each integer s* there exists such a process £ that for 
each binary word u the process £ creates u with the probability 2~l u l if the 
length of the u (\u\) is less than or equal to s*, but, on the other hand, the 
probability distribution £(v) is very far from uniform if the length of the 



3 



words v is greater than s*. (So, if we use a test with the block length s < s*, 
the sequences generated by £ will look like random, in spite of £ is far from 
being random.) 

The outline of the paper is as follows. In Section 2 the general method 
for construction randomness testing algorithms basing on lossless data com- 
pressors is described. Two new tests for randomness, which are based on 
constructions of universal coding, as well as the two-faced processes, are 
described in the Section 3. In Section 4 the new tests are experimentally 
compared with methods from " A statistical test suite for random and pseu- 
dorandom number generators for cryptographic applications" , which was re- 
cently suggested by Rukhin and others, 2001. It turns out that the new tests 
are more powerful than known ones. 

2 Data compression methods as a basis for 
randomness testing 

2.1. Randomness testing based on data compression 

Let A be a finite alphabet and A n be the set of all words of the length n 
over A, where n is an integer. By definition, A* = IXLi A n and A°° is the 
set of all infinite words X1X2 ■ ■ ■ over the alphabet A. A data compression 
method (or code) ip is defined as a set of mappings ip n such that ip n : A n — > 
{0, 1}*, n — 1,2,... and for each pair of different words x,y G A n tp n (x) ^ 
<p n {y)- Informally, it means that the code ip can be applied for compression 
of each message of any length n, n > over alphabet A and the message can 
be decoded if its code is known. 

Now we can describe a statistical test which can be constructed basing 
on any code ip. Let n be an integer and Hq be a hypothesis that the words 
from the set A n obey the uniform distribution, i.e., p(u) = \A\~ n for each 
u G {0, 1}™. (Here and below \x\ is the length if a; is a word, and the number 
of elements if x is a set.) Let a required level of significance (or a Type I 
error) be a, a G (0, 1). The following main idea of a suggested test is quite 
natural: The well compressed words should be considered as non- random 
and Hq should be rejected. More exactly, we define a critical value of the 
suggested test by 

t a = nlog|;4| -log(l/a) - 1. (1) 
(Here and below logx = log 2 x.) 
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Let u be a word from A n . By definition, the hypothesis Hq is accepted if 
|(/? n (-u)| > t a and rejected, if |y?„(w)| < t a . We denote this test by ' . 

Theorem 1. For each integer n and a code if, the Type I error of the 
described test is not larger than a. 

Proof is given in Appendix. 

Comment 1. The described test can be modified in such a way that 
the Type I error will be equal to a. For this purpose we define the set A 7 by 

A 7 = {x : x e A n & \<f n (x)\ = 7} 

and an integer g for which the two following inequalities are valid: 

9 9+1 

^\Aj\ < a \A\ n < ^\Aj\. (2) 

j=0 j=0 

Now the modified test can be described as follows: 

If for 16 A" I <£>„(#) I < g then H is rejected, if 1^(^)1 > ((? + 1) then 
H is accepted and if |<^ n (^)| — (9 + 1) the hypothesis H is accepted with 
the probability 

9+1 

(£\Aj\ - a\A\ n )/\A y+l \ 
3=1 

and rejected with the probability 

9+1 

1 " (El^l " oc\A\ n )/\A g+1 \. 
j'=i 

(Here we used a randomized criterion, see for definition, for example, Kendall 
and Stuart, 1961, part 22.11.) We denote this test by T£) . 

Claim 1. For each integer n and a code tp, the Type I error of the 
described test is equal to a. 
Proof is given in Appendix. 

We can see that this criterion has the level of significance (or Type I error) 
exactly a, whereas the first criterion, which is based on critical value (1), has 
the level of significance that could be less than a. In spite of this drawback, 
the first criterion may be more useful due to its simplicity. Moreover, such an 
approach gives a possibility to use a data compression method tp for testing 
even in case where the distribution of the length \tp n (x)\,x G A n is not known. 
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Comment 2. We have considered codes, for which different words of 
the same length have different codewords (In Information Theory sometimes 
such codes are called non- singular.) Quite often a stronger restriction is 
required in Information Theory. Namely, it is required that each sequence 
ip n (xi)ip n (x2)---(p(x r ), r > 1, of encoded words from the set A n ,n > 1, can be 
uniquely decoded into Xix 2 ...x r . Such codes are called uniquely decodable. 
For example, let A = {a, b}, the code ipi(a) = 0,ipi(b) = 00, obviously, is 
non- singular, but is not uniquely decodable. (Indeed, the word 000 can be 
decoded in both ab and ba.) It is well known in Information Theory that a 
code ip can be uniquely decoded if the following Kraft inequality is valid: 

Y, ueAn < 1 , (3) 

see, for ex., Gallager, 1968. 

If it is known that the code is uniquely decodable, the suggested critical 
value (1) can be changed. Let us define 

t a = ralog|A|-log(l/a). (4) 

Let, as before, u be a word from A n . By definition, the hypothesis Hq is 
accepted if |y? n (w)| > t a and rejected, if |<^ n (u)| < t a . We denote this test by 

p(n) 
£*,¥>" 

Claim 2. For each integer n and a uniquely decodable code p, the Type 
I error of the described test is not larger than a. 
Proof is given in Appendix. 

So, we can see from (1) and (4) that the critical value is larger, if the 
code is uniquely decodable. On the other hand, the difference is quite small 
and (1) can be used without a large loose of the test power even in a case of 
the uniquely decodable codes. 

It should not be a surprise that the level of significance (or a Type I error) 
does not depend on the alternative hypothesis Hi, but, of course, the power 
of a test (and the Type II error) will be determined by Hi. 

The examples of testing by real data compression methods will be given 
in Section 4. 

2.2. Randomness testing based on universal codes. 
We will consider the main hypothesis H that the letters of a given se- 
quence XiX2-..x t , Xi G A, are independent and identically distributed (i.i.d.) 
with equal probabilities of all a G A and the alternative hypothesis Hi that 
the sequence is generated by a stationary and ergodic source, which generates 
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letters from A and differs from the source under H Q . (If A = {0, 1}, i.i.d. 
coincides with Bernoulli source.) The definition of the stationary and ergodic 
source and the Shannon entropy of such sources can be found in Billingsley, 
1965, and Gallager, 1968. 

We will consider statistical tests, which are based on universal coding and 
universal prediction. First we define a universal code. 

By definition, (p is a universal code if for each stationary and ergodic 
source (or a process) n the following equality is valid with probability 1 
(according to the measure n ) 



where h(n) is the Shannon entropy. ( Such codes exist, see Ryabko, 1984.) 
It is well known in Information Theory that h(n) = log \A\ if H is true, and 
h(7r) < log | A | if Hi is true, see, for ex., Billingsley, 1965; Gallager, 1968. 
From this property and (5) we can easily yield the following theorem. 

Theorem 2. Let ip be a universal code, a G (0, 1) be a level of signifi- 
cance and a sequence x\X2---x n , n > 1, be generated by a stationary ergodic 
source n . If the described above test is applied for testing H (against 
Hi), then, with probability 1, the Type I error is not larger than a, and the 
Type II error goes to 0, when n — > oo. 

So, we can see that each good universal code can be used as a basis for 
randomness testing. But converse proposition is not true. Let, for example, 
there be a code, whose codeword length is asymptotically equal to (0.5 + 
h(n)/2) for each source n (with probability 1, where, as before, h(ir) is the 
Shannon entropy). This code is not good, because its codeword length does 
not tend to the entropy, but, obviously, such code could be used as a basis 
for a test of randomness. So, informally speaking, the set of tests is larger 
than the set of universal codes. 

Note that the close problems were considered by Bailey (1974), who ob- 
tained many important results in this field. 



3 Two new tests for randomness and two- 
faced processes 



Firstly, we suggest two tests which are based on ideas of universal coding, 
but they are described in such a way that can be understood without any 




(5) 
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knowledge of Information Theory. 
3.1. The "book stack" test 

Let, as before, there be given an alphabet A = {a\, ...,as}, a source, 
which generates letters from A, and two following hypotheses: the source is 
i.i.d. andp(ai) = .... = p(as) — 1/S (H ) and Hi = ->H . We should test the 
hypotheses basing on a sample X\X 2 n > 1 , generated by the source. 
When the "book stack" test is applied, all letters from A are ordered from 
1 to S and this order is changed after observing each letter x t according to 
the formula 



u t+l {a) 



1, if x t — a ; 

z/(a) + 1, if v\a) < v\x t ); (6) 
z/(a), if v\a) > v\x t ) , 

where v l is the order after observing X\X 2 ... x t , t = 1 , , ... , n , u 1 is defined 
arbitrarily. (For ex., we can define u l = {a±, as}-) Let us explain (6) 
informally. Suppose that the letters of A make a stack, like a stack of books 
and v l (a) is a position of a in the stack. Let the first letter x\ of the word 
x\x 2 ■■■ x n be a. If it takes ii — th position in the stack {v l {a) = ii), then take 
a out of the stack and put it on the top. (It means that the order is changed 
according to (6).) Repeat the procedure with the second letter x 2 and the 
stack obtained, etc. 

It can help to understand the main idea of the suggested method if we take 
into account that, if Hi is true, then frequent letters from A (as frequently 
used books) will have relatively small numbers (will spend more time next 
to the top of the stack). On the other hand, if H is true, the probability to 
find each letter Xi at each position j is equal to 1/S. 

Let us proceed with the description of the test. The set of all indexes 
{1, . . . , S} is divided into r, r > 2, subsets A\ = {1, 2, . . . , ki}, A 2 = {k\ + 
1, . . . , k 2 }, . . . , A r = {A; r _i + 1, . . . , k r }. Then, using x\x 2 ... x n , we calcu- 
late how many z/*(x t ), t = 1, ...,n, belong to a subset A k , k = l,...,r. We 
define this number as n k (or, more formally, n k = \{t : ^(xt) G A k ,t = 
1, . . . , n}\, k = 1, r.) Obviously, if H is true, the probability of the event 
^(xt) G A k is equal to \Aj\/S. Then, using a "common" chi- square test we 
test the hypothesis Hq = P{v t {x t ) G Ak} = \Ak\/S basing on the empirical 
frequencies ni, . . . , n r , against Hi = ~^H . Let us recall that the value 



x 



2 



- (n,-n(lAl/S)) 2 

h <\M/s) 1 ) 



8 



is calculated, when chi- square test is applied, see, for ex., Kendall and Stuart, 
1961. It is known that x 2 asymptotically follows the ^-square distribution 
with (k — 1) degrees of freedom (xl-i) if is true. If the level of significance 
(or a Type I error) of the x 2 test is a, a G (0, 1), the hypothesis Ho is accepted 
when x 2 from (7) is less than the (1 — a) -value of the xt-i distribution; 
see, for ex., Kendall, Stuart, 1961. 

We do not describe the exact rule how to construct the subsets {Ai,A 2 , 
. . . , A r }, but we recommend to perform some experiments for finding the 
parameters, which make the sample size minimal (or, at least, acceptable). 
The point is that there are many cryptographic and other applications where 
it is possible to implement some experiments for optimizing the parameter 
values and, then, to test hypothesis basing on independent data. For exam- 
ple, in case of testing a PRNG it is possible to seek suitable parameters using 
a part of generated sequence and then to test the PRNG using a new part 
of the sequence. 

Let us consider a simple example. Let A = {a±, . . . ,a 6 }, r = 2, A\ = 
{ai,a 2 ,a 3 }, A 2 = {a 4 ,a 5 ,a 6 }, x 1 . . . x 8 = a^a^aQaiaQai. If v x = 1,2,3,4, 
5, 6, then v 2 = 3, 1, 2, 4, 5, 6, u 3 = 6, 3, 1, 2, 4, 5, etc., and n\ = 7, n 2 = 1. We 
can see that the letters a 3 and a§ are quite frequent and the "book stack" 
indicates this nonuniformity quite well. (Indeed, the average values of n\ and 
n 2 equal 4, whereas the real values are 7 and 1, correspondingly.) 

Examples of practical applications of this test will be given in Section 4, 
but here we make two notes. Firstly, we pay attention to the complexity of 
this algorithm. The "naive" method of transformation according to (6) could 
take the number of operations proportional to S, but there exist algorithms, 
which can perform all operations in (6) using 0(\ogS) operations. Such 
algorithms can be based on AVL- trees, see, for ex., Aho,Hopcroft and Ulman, 
1976. 

The last comment concerns with the name of the method. The "book 
stack" structure is quite popular in Information Theory and Computer Sci- 
ence. In Information Theory this structure was firstly suggested as a basis of 
an universal code by Ryabko, 1980, and was rediscovered by Bently, Sleator, 
Tarjan, Wei in 1986, and Elias in 1987 (see also a comment of Ryabko (1987) 
about a history of this code). In English language literature this code is 
frequently called as " Move-to- Front" (MTF) scheme as it was suggested by 
Bently, Sleator, Tarjan and Wei. Now this data structure is used in a caching 
and many other algorithms in Computer Science under the name "Move-to- 
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Front" . It is also worth noting that the book stack was firstly considered by 
a soviet mathematician M.L. Cetlin as an example of a self- adaptive system 
in 1960's, see Rozanov, 1971. 

3.2. The order test 

This test is also based on changing the order v t {a) of alphabet letters but 
the rule of the order change differs from (6). To describe the rule we first 
define X t+1 (a) as a count of occurrences of a in the word x\ . . . x t -\X t . At each 
moment t the alphabet letters are ordered according to v l in such a way that, 
by definition, for each pair of letters a and b z/(a) -< z/(6) if A* (a) < A* (6). 
For example, if A = {ai, a 2 , 03} and xix 2 x 3 = a 3 a 2 a 3 , the possible orders can 
be as follows: v 1 = (1,2,3), v 2 = (3,1,2), z/ 3 = (3,2,1), z/ 4 = (3,2,1). In 
all other respects this method coincides with the book stack. (The set of all 
indexes {1, . . . , S} is divided into r subsets, etc.) 

Obviously, after observing each letter x t the value X t (x t ) should be in- 
creased and the order v l should be changed. It is worth noting that there 
exist a data structure and algorithm, which allow maintaining the alphabet 
letters ordered in such a way that the number of operations spent is constant, 
independently of the size of the alphabet. This data structure was described 
by Moffat, 1999 and Ryabko, Rissanen, 2003. 

3.3. Two- faced processes and the choice of the block length 
for a process testing 

There are quite many methods for testing H and Hi, where the bit 
stream is divided into words (blocks) of the length s, s > 1, and the sequence 
of the blocks X1X2 . . . x s , x s+ \ . . . X2 S , ... is considered as letters, where each 
letter belongs to the alphabet B s = {0, 1} S and has the probability 2~ s , if 
Hq is true. For instance, both above described tests, methods from Ryabko, 
Stognienko and Shokin (2003) and quite many other algorithms belong to 
this kind. That is why the questions of choosing the block length s will be 
considered here. 

As it was mentioned in the introduction there exist two-faced processes, 
which, on the one hand, are far from being truly random, but, on the other 
hand, they can be distinguished from truly random only in the case when 
the block length s is large. From the information theoretical point of view 
the two- faced processes can be simply described as follows. For a two- faced 
process, which generates letters from {0, 1}, the limit Shannon entropy is 
(much) less than 1 and, on the other hand, the s— order entropy (h s ) is 
maximal (h s = 1 bit per letter) for relatively large s. 

We describe two families of two- faced processes T(k,ir) and T(k,ir), 
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where k — 1,2, ... , and rc e (0, 1) are parameters. The processes T{k, rc) and 
T(k, 7r) are Markov chains of the connectivity (memory) k, which generate 
letters from {0, 1}. It is convenient to define them inductively The process 
T(l, 7r) is defined by conditional probabilities Pr(i,7r)(0/0) = ir, Pr(i i7r )(0/1) = 
1 — 7r (obviously P T (i )7r )(l/0) = 1 — n, P T ( li7r )(l/l) = 7r). The process T(l, n) 
is defined by -Pf(i i7r )(0/0) = 1 — 7r, Pf( l7r )(0/1) = 7r. Assume that T{k, n) and 
T(/c, 7r) are defined and describe T(/c + 1, 7r) and T(k + 1, 7r) as follows 

PT(k+i,n)(0/0u) = P T (k,7T)(0/u),P T{k+1:7T )(l/0u) = P T(fe)7r) (l/u), 

P T (k+l,n)(0/lu) = Pf {ki7T) (0/u),P T{k+1>7v) (l/lu) = P nkyn) (l/u), 

and, vice versa, 

^T(fe+l )7 r)(0/0u) = Pf(^)(0/M),Pf(fc+l^)(l/0M) = Pf(fc, w )(l/w), 
*T(k+l,ff)(°/ lu ) = P T (k,n){0/u),Pf (k+1:W) {l/lu) = P nk ,n)(l/u) 

for each u <E B k (here f-u is a concatenation of the words v and w). For 
example, 

Pt(2,.)(0/00) = 7T,P T(2iW) (0/01) = l-7r,P T(2i7r) (0/10) = l-vr, P r(2)ff) (0/ll) = vr 

The following theorem shows that the two-faced processes exist. 

Theorem 3. For each n G (0, 1) the s-order Shannon entropy (h s ) of the 
processes T{k, re) andT{k, re) equals 1 bit per letter for s — 0, 1, . . . , k whereas 
the limit Shannon entropy (hoc) equals — (7rlog 2 7r + (1 — n) log 2 (l — rc j). 

The proofs of the theorem is given in Appendix, but here we consider 
examples of "typical" sequences of the processes T(l, 7r) and T(l, 7r) for 7r, say, 

1/5. Examples are: 010101101010100101... and 000011111000111111000 

We can see that each sequence contains approximately one half of l's and one 
half of 0's. (That is why the first order Shannon entropy is 1 per a letter.) 
On the other hand, both sequences do not look like truly random, because 
they, obviously, have too long subwords like either 101010.. or 000. .11111... 
(In other words, the second order Shannon entropy is much less than 1 per 
letter.) Hence, if a randomness test is based on estimation of frequencies of 
0's and l's only, then such a test will not be able to find deviations from 
randomness. 

So, if we revert to the question about the block length of tests and take 
into account the existence of two- faced processes, it seems that the block 
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length could be taken as large as possible. But it is not so. The following 
informal consideration could be useful for choosing the block length. The 
point is that statistical tests can be applied if words from the sequence 

X\X<i ■ ■ ■ X s , X s +i . . . X2s, ■ ■ ■ , 2 ; (m-l)s+l ; £(ra-l)s+2 • • • X ms 

(8) 

are repeated (at least a few times) with high probability (here ms is the 
sample length). Otherwise, if all words in (8) are unique (with high prob- 
ability) when H is true, a sensible test cannot be constructed basing on 
a division into s— letter words. So, the word length s should be chosen in 
such a way that some words from the sequence (8) are repeated with high 
probability, when H is true. So, now our problem can be formulated as 
follows. There is a binary sequence Xix 2 ■ ■ ■ x n generated by the Bernoulli 
source with P(xi = 0) = P(xi = 1) = 1/2 and we want to find such a block 
length s that the sequence (8) with m = \n/s\, contains some repetitions 
(with high probability). This problem is well known in the probability the- 
ory and sometimes called as the birthday problem. Namely, the standard 
statement of the problem is as follows. There are S = 2 s cells and m (= n/s) 
pellets. Each pellet is put in one of the cells with the probability 1/S. It is 
known in Probability Theory that, if m = c y/~S, c > then the average num- 
ber of cells with at least two pellets equals c 2 (1/2 + o(l) ), where S goes to 
oo ; see Kolchin, Sevast'yanov and Chistyakov, 1976. In our case the number 
of cells with at least two pellets is equal to the number of the words from 
the sequence (8) which are met two (or more) times. Having into account 
that S = 2 s , m = n/s, we obtain from m = c \/S, c > an informal rule for 
choosing the length of words in (8): 

n x sT' 2 (9) 

where n is the length of a sample xiX2-..x n , s is the block length. If s is 
much larger, the sequence (8) does not have repeated words (in case Ho ) 
and it is difficult to build a a sensible test. On the other hand, if s is much 
smaller, large classes of the alternative hypotheses cannot be tested (due to 
existence of the two-faced processes). It is worth noting that it is impossible 
to have a universal choice of s, because it is impossible to avoid the two- 
faced phenomenon. In other words this fact can be explained basing on 
the following known result of Information Theory: it is impossible to have 
guaranteed rate of code convergence universally for all ergodic sources; see 
Bailey, 1976, Ryabko, 1984. That is why, it is impossible to choose a universal 
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length s. On the other hand, there are many applications where the word 
length s can be chosen experimentally. (But, of course, such experiments 
should be performed on the independent data.) 

4 The experiments 

In this part we describe some experiments carried out to compare new tests 
with known ones. We will compare order test, book stack test, tests which 
are based on standard data compression methods, and tests from Rukhin 
and others, 2001. The point is that the tests from Rukhin and others are 
selected basing on comprehensive theoretical and experimental analysis and 
can be considered as the state-of-the-art in randomness testing. Besides, we 
will also test the method recently published by Ryabko, Stognienko, Shokin, 
(2004), because it was published later than the book of Rukhin and others. 

We used data generated by the PRNG "RANDU" (described in Dudewicz 
and Ralley, 1981) and random bits from "The Marsaglia Random Number 
CDROM", see: http://stat.fsu.edu/diehard/cdrom/ ). RANDU is a linear 
congruent generators (LCG), which is defined by the following equality 

X n+1 = (A X n + C) mod M , 

where X n is n-th generated number. RANDU is defined by parameters A = 
2 16 + 3, C = 0, M = 2 31 ,Xo = 1. Those kinds of sources of random data 
were chosen because random bits from "The Marsaglia Random Number 
CDROM" are considered as good random numbers, whereas it is known that 
RANDU is not a good PRNG. It is known that the lowest digits of X n are 
"less random" than the leading digits (Knuth, 1981), that is why in our 
experiments with RANDU we extract an eight-bit word from each generated 
Xi by formula X t = [Xi/2 23 \ . 

The behavior of the tests was investigated for files of different lengths 
(see the tables below). We generated 100 different files of each length and 
applied each mentioned above test to each file with level of significance 0.01 
(or less, see below). So, if a test is applied to a truly random bit sequence, 
on average 1 file from 100 should be rejected. All results are given in the 
tables, where integers in boxes are the number of rejected files (from 100). If 
a number of the rejections is not given for a certain length and test, it means 
that the test cannot be applied for files of such a length. 
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The table 1 contains information about testing of sequences of different 
lengths generated by RANDU, whereas the table 2 contains results of appli- 
cation of all tests to 5 000 000- bit sequences either generated by RANDU or 
taken from "The Marsaglia Random Number CDROM". For example, the 
first number of the second row of the table 1 is 56. It means that there were 
100 files of the length 5 10 4 bits generated by PRNG RANDU. When the 
Order test was applied, the hypothesis H was rejected 56 times from 100 
(and, correspondingly, Hq was accepted 44 times.) The first number of the 
third line shows that H was rejected 42 times, when the Book stack test was 
applied to the same 100 files. The third number of the second line shows that 
the hypothesis H was rejected 100 times, when the Order test was applied 
for testing of 100 100000-bit files generated by RANDU, etc. 

Let us first give some comments about the tests, which are based on 
popular data compression methods RAR and ARJ. In those cases we applied 
each method to a file and first estimated the length of compressed data. Then 
we use the test with the critical value (1) as follows. The alphabet 
size \A\ = 2 s = 256, nlog|A| is simply the length of file (in bits) before 
compression, (whereas n is the length in bytes). So, taking a = 0.01, from 
(1), we see that the hypothesis about randomness (Ho) should be rejected, if 
the length of compressed file is less than or equal to n log \A\ —8 bits. (Strictly 
speaking, in this case a < 2~ 7 = 1/128.) So, taking into account that the 
length of computer files is measured in bytes, this rule is very simple: if the 
n— byte file is really compressed (i.e. the length of the encoded file is n — 1 
bytes or less), this file is not random (and H is rejected). So, the following 
tables contain numbers of cases, where files were really compressed. 

Let us now give some comments about parameters of the considered meth- 
ods. As it was mentioned, we investigated all methods from the book of 
Rukhin and others (2001), the test of Ryabko, Stognienko and Shokin, 2004 
(RSS test for short), the described above two tests based on data compression 
algorithms, the order tests and the book stack test. For some tests there are 
parameters, which should be specified. In such cases the values of parame- 
ters are given in the table in the row, which follows the test results. There 
are some tests from the book of Rukhin and others, where parameters can 
be chosen from a certain interval. In such cases we repeated all calculations 
three times, taking the minimal possible value of the parameter, the maximal 
one and the average one. Then the data for the case when the number of 
rejections of the hypothesis H is maximal, is given in the table. 

The choice of parameters for RSS, the book stack test and the order test 
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was made on the basis of special experiments, which were carried out for 
independent data. (Those algorithms are implemented as a Java program 
and can be found on the internet, see http : / / 'web.ict.nsc.ru/~rng / '.) In all 
cases such experiments have shown that for all three algorithms the optimal 
blocklength is close to the one defined by informal equality (9). 

We can see from the tables that the new tests can detect non-randomness 
more efficiently than the known ones. Seemingly, the main reason is that 
RSS, book stack tests and order test deal with such large blocklength as it 
is possible, whereas many other tests are focused on other goals. The second 
reason could be an ability for adaptation. The point is that the new tests 
can find subwords, which are more frequent than others, and use them for 
testing, whereas many other tests are looking for particular deviations from 
randomness. 

In conclusion, we can say that the obtained results show that the new 
tests, as well as the ideas of Information Theory in general, can be useful 
tools for randomness testing. 

5 Appendix. 

Proof of Theorem 1. First we estimate the number of words tp n (u) whose 
length is less than or equal to an integer r. Obviously, at most one word can 
be encoded by the empty codeword, at most two words by the words of the 
length 1, at most 2 l can be encoded by the words of length i, etc. Having 
taken into account that the codewords (p n (u) ^ <^n(v) for different u and v, 
we obtain the inequality 

\{u:\ Vn {u)\<r}\<j^T = 2 T+1 -l. 

i=0 

From this inequality and (1) we can see that the number of words from the set 
{A n }, whose codelength is less than or equal to t a = n log \A\ — log(l/a) — 1, 
is not greater than 2 nlog l j4 l _log(1 / Q! ). So, we obtained that 

\{u : | ¥»„(«) | < t a }\ < a\A\ n . 

Taking into account that all words from A n have equal probabilities if H is 
true, we obtain from the last inequality, (1) and the description of the test 
that 

Pr{\<p n (u)\<t a \}<(a\A\ n /\A\ n )=a 
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if Hq is true. The theorem is proved. 

Proof of Claim 1. The proof is based on a direct calculation of the 
probability of rejection for where Hq is true. From the description of 

the test and definition of g (see (8)) we obtain the following chain of 
equalities. 

Pr{H is rejected } = Pr{ \<p n (u)\ < g) 

9+1 

+ Pr-f»)| = « 7 + l}(l ~ (El^'l ~ a\An/\A g+1 \)) 

3=1 

= 4(EI^-| + " (EMM -a\A\ n )/\A a+1 \)) = a. 

71 3=0 3=1 

The claim is proved. 

Proof of Claim 2. We can think that t a in (4) is an integer. (Otherwise, 
we obtain the same test taking \t a \ as a new critical value of the test.) From 
the Kraft inequality (3) we obtain that 

1 > E 2H " n(M)l >\i u: \ Vn{u)\ < L}\ 2~ ia . 

u£A n 

This inequality and (4) yield: 

\{u:\ip n (u)\<t a }\<a\A\ n . 

If H Q is true then the probability of each u G A n equals \A\~ n and from the 
last inequality we obtain that 

Pr{\ip(u)\ < t a } = \A\~ n \{u:\ <p n (u)\ < t a }\ < a, 

if H is true. The claim is proved. 

Proof of Theorem 3. We prove the theorem for the process T(k, ir), but 
this proof is valid for T(k, n), too. First we show that 

p*( Xl ...x d ) = 2~ d , (10) 

(x\...Xd) € {0, l} d , d — 1, k, is a stationary distribution for the processes 
T(k, 7r) (and T(k, n)) for all A; = 1,2,... and n G (0, 1). For any values of 
k, k > 1, (10) will be proved if we show that the system of equations 

PT(k,7r)( X l"- X d) = PT(k,*)(0Xl---Xd-l)PT(k,*)(x d /0x 1 ...X d - 1 ) 
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+ P T (k,n)(lx 1 ...X d _ 1 ) P T{k ^ ) (x d /lx l ...X d _ l ) 

has the solution p(x 1 ...x d ) = 2~ d , (xi...x d ) £ {0, l} d , d = 1, 2, . . . , k. It can 
be easily seen for d — k, if we take into account that, by definition of T(k, it) 
and f(k,n), the equality P T ^ n )(x k /0x 1 ...x k ^ 1 ) + P T{ktn) (x k /lx 1 ...x k - 1 ) = 1 
is valid for all (x\...x k ) G {0, l} fc . From this equality and the law of total 
probability we immediately obtain (10) for d < k. 

Let us prove the second claim of the theorem. From the definition T(k, n) 
and T(/c, 7r) we can see that either PT(k,n)(0/xi...x k ) = vr, Pr(fe,7r)(l/ 'xi...x k ) = 
1 - 7r or P T ( ky7T - ) (0/x 1 ...x k ) = 1 - 7T, P T (k,-K)iX/xi...x k ) = 7T. That is why 
h(x k +i/xi...x k ) = — (7rlog 2 7r+(l— 7r) log 2 (l— 7r)) and, hence, /i,^ = — (7rlog 2 7r+ 
(1 — 7r) log 2 (l — 7r)). The theorem is proved. 
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Table 1: Number of files generated by PRNG RANDU and recognized as 
non-random for different tests and different file lengths (in bits). 



Name of test/Length of file 


5 10 4 


10 5 


5 10 5 


10 6 


Order test 


56 


100 


100 


100 


Book stack 


42 


100 


100 


100 


parameters for both tests 


s=20, \A 


i| = 5^ 


RSS 


4 


75 


100 


100 


parameters 


s=16 


s=17 


s= 


=20 


DAD 

KAK 








100 


100 


ARJ 








99 


100 


Frequency 


2 


1 


1 


2 


Block Frequency 


1 


2 


1 


1 


parameters 


M=1000 


M=2000 


M = 10 5 


M=20000 


Cumulative Sums 


2 


1 


2 


1 


Runs 





2 


1 


1 


Longest Run of Ones 





1 








Rank 





1 


1 





Discrete Fourier Transform 











1 


NonOverlapping Templates 


— 


— 


— 


2 


parameters 








m=10 


Overlapping Templates 








2 


parameters 








m=10 


TT* 1 

Universal Statistical 






1 


1 


parameters 






L=6 
Q=640 


L=7 
Q=1280 


Approximate Entropy 


1 


2 


2 


7 


parameters 


m=5 


m=ll 


m=13 


m=14 


Random Excursions 








2 


Random Excursions Variant 








2 


Serial 





1 


2 


2 


parameters 


m=6 


m=14 


m=16 


m=8 


Lempel-Ziv Complexity 








1 


Linear Complexity 








3 


parameters 


21 






M=2500 



Table 2: Number of 5 000 000- bit files generated by PRNG RANDU and 
random, which are recognized as non-random. 



TV T {* i j / f r 1 • i p pi 

Name of test/ Kind of hie 


T~~v A AT 7 — \ T T 

RANDU 


random 


Order test 


100 


3 


Book stack 


100 





parameters for both tests 


s=24, Ai 


= 5a/2^ 


rvoo 


100 


1 


parameters 


s=24 


s=24 


DAD 

rlArx, 


1 nn 
1UU 


n 
U 


ARJ 


100 


n 




Frequency 


o 
L 


1 
1 


Block rrequency 


2 


1 


parameters 


M = 10 


M = 10 


Cumulative Sums 


o 
O 


2 


Runs 


2 


2 


Longest Run of Ones 


2 





Rank 


1 


l 


Discrete Fourier Transform 


89 


9 


INonOverlappmg templates 


5 


5 


parameters 


m=10 


m=10 


Overlapping Templates 


4 


1 


parameters 


m=10 


m=10 


Universal Statistical 


1 


2 


parameters 


L=9 


L=9 




Q=5120 


Q=5120 


Approximate Entropy 


100 


89 


parameters 


m=17 


m=17 


Random Excursions 


4 


3 


Random Excursions Variant 


3 


3 


Serial 


100 


2 


parameters 


m=19 


m=19 


Lempel-Ziv Complexity 








Linear Complexity 


4 


3 


parameters 22 


M=5000 


M=2500 



